{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "hideCode": false,
    "hidePrompt": false
   },
   "source": [
    "# RL Exercise 1 - Markov Decision Processes\n",
    "\n",
    "**GOAL:** The goal of the exercise is to introduce the Markov Decision Process abstraction and to show how to use Markov Decision Processes in Python.\n",
    "\n",
    "**The key abstraction in reinforcement learning is the Markov decision process (MDP).** An MDP models sequential interactions with an external environment. It consists of the following:\n",
    "- a **state space**\n",
    "- a set of **actions**\n",
    "- a **transition function** which describes the probability of being in a state $s'$ at time $t+1$ given that the MDP was in state $s$ at time $t$ and action $a$ was taken\n",
    "- a **reward function**, which determines the reward received at time $t$\n",
    "- a **discount factor** $\\gamma$\n",
    "\n",
    "More details are available [here](https://en.wikipedia.org/wiki/Markov_decision_process).\n",
    "\n",
    "**NOTE:** Reinforcement learning algorithms are often applied to problems that don't strictly fit into the MDP framework. In particular, situations in which the state of the environment is not fully observed lead to violations of the MDP assumption. Nevertheless, RL algorithms can be applied anyway.\n",
    "\n",
    "## Policies\n",
    "\n",
    "A **policy** is a function that takes in a **state** and returns an **action**. A policy may be stochastic (i.e., it may sample from a probability distribution) or it can be deterministic.\n",
    "\n",
    "The **goal of reinforcement learning** is to learn a **policy** for maximizing the cumulative reward in an MDP. That is, we wish to find a policy $\\pi$ which solves the following optimization problem\n",
    "\n",
    "\\begin{equation}\n",
    "\\arg\\max_{\\pi} \\sum_{t=1}^T \\gamma^t R_t(\\pi),\n",
    "\\end{equation}\n",
    "\n",
    "where $T$ is the number of steps taken in the MDP (this is a random variable and may depend on $\\pi$) and $R_t$ is the reward received at time $t$ (also a random variable which depends on $\\pi$).\n",
    "\n",
    "A number of algorithms are available for solving reinforcement learning problems. Several of the most widely known are [value iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration), [policy iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Policy_iteration), and [Q learning](https://en.wikipedia.org/wiki/Q-learning).\n",
    "\n",
    "## RL in Python\n",
    "\n",
    "The `gym` Python module provides MDP interfaces to a variety of simulators. For example, the CartPole environment interfaces with a simple simulator which simulates the physics of balancing a pole on a cart. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0. This example fits into the MDP framework as follows.\n",
    "- The **state** consists of the position and velocity of the cart as well as the angle and angular velocity of the pole that is balancing on the cart.\n",
    "- The **actions** are to decrease or increase the cart's velocity by one unit.\n",
    "- The **transition function** is deterministic and is determined by simulating physical laws.\n",
    "- The **reward function** is a constant 1 as long as the pole is upright, and 0 once the pole has fallen over. Therefore, maximizing the reward means balancing the pole for as long as possible.\n",
    "- The **discount factor** in this case can be taken to be 1.\n",
    "\n",
    "More information about the `gym` Python module is available at https://gym.openai.com/."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import absolute_import\n",
    "from __future__ import division\n",
    "from __future__ import print_function\n",
    "\n",
    "import gym\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below illustrates how to create and manipulate MDPs in Python. An MDP can be created by calling `gym.make`. Gym environments are identified by names like `CartPole-v0`. A **catalog of built-in environments** can be found at https://gym.openai.com/envs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hideCode": false,
    "hidePrompt": false
   },
   "outputs": [],
   "source": [
    "env = gym.make('CartPole-v0')\n",
    "print('Created env:', env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reset the state of the MDP by calling `env.reset()`. This call returns the initial state of the MDP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hideCode": false,
    "hidePrompt": false
   },
   "outputs": [],
   "source": [
    "state = env.reset()\n",
    "print('The starting state is:', state)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `env.step` method takes an action (in the case of the CartPole environment, the appropriate actions are 0 or 1, for moving left or right). It returns a tuple of four things:\n",
    "1. the new state of the environment\n",
    "2. a reward\n",
    "3. a boolean indicating whether the simulation has finished\n",
    "4. a dictionary of miscellaneous extra information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hideCode": false,
    "hidePrompt": false
   },
   "outputs": [],
   "source": [
    "# Simulate taking an action in the environment. Appropriate actions for\n",
    "# the CartPole environment are 0 and 1 (for moving left and right).\n",
    "action = 0\n",
    "state, reward, done, info = env.step(action)\n",
    "print(state, reward, done, info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A **rollout** is a simulation of a policy in an environment. It alternates between choosing actions based (using some policy) and taking those actions in the environment.\n",
    "\n",
    "The code below performs a rollout in a given environment. It takes **random actions** until the simulation has finished and returns the cumulative reward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def random_rollout(env):\n",
    "    state = env.reset()\n",
    "    \n",
    "    done = False\n",
    "    cumulative_reward = 0\n",
    "\n",
    "    # Keep looping as long as the simulation has not finished.\n",
    "    while not done:\n",
    "        # Choose a random action (either 0 or 1).\n",
    "        action = np.random.choice([0, 1])\n",
    "        \n",
    "        # Take the action in the environment.\n",
    "        state, reward, done, _ = env.step(action)\n",
    "        \n",
    "        # Update the cumulative reward.\n",
    "        cumulative_reward += reward\n",
    "    \n",
    "    # Return the cumulative reward.\n",
    "    return cumulative_reward\n",
    "    \n",
    "reward = random_rollout(env)\n",
    "print(reward)\n",
    "reward = random_rollout(env)\n",
    "print(reward)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**EXERCISE:** Finish implementing the `rollout_policy` function below, which should take an environment *and* a policy. The *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, the action should be chosen **with the policy** (as a function of the state)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rollout_policy(env, policy):\n",
    "    state = env.reset()\n",
    "    \n",
    "    done = False\n",
    "    cumulative_reward = 0\n",
    "\n",
    "    # EXERCISE: Fill out this function by copying the 'random_rollout' function\n",
    "    # and then modifying it to choose the action using the policy.\n",
    "    raise NotImplementedError\n",
    "\n",
    "    # Return the cumulative reward.\n",
    "    return cumulative_reward\n",
    "\n",
    "def sample_policy1(state):\n",
    "    return 0 if state[0] < 0 else 1\n",
    "\n",
    "def sample_policy2(state):\n",
    "    return 1 if state[0] < 0 else 0\n",
    "\n",
    "reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])\n",
    "reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])\n",
    "\n",
    "print('The first sample policy got an average reward of {}.'.format(reward1))\n",
    "print('The second sample policy got an average reward of {}.'.format(reward2))\n",
    "\n",
    "assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '\n",
    "                          'by applying the policy to the state.')\n",
    "assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '\n",
    "                           'by applying the policy to the state.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "hide_code_all_hidden": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
